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Abstract 



Suppose we are given a time scries or a signal x{t) for < t < T. We consider 
,——1 the problem of predicting the signal in the interval T < t < T+t f from a knowledge 

Q of its history and nothing more. We ask the following question: what is the largest 

value of t f for which a prediction can be made? We show that the answer to this 
question is contained in a fundamental result of information theory due to Wyner, 
^ Ziv, Ornstein, and Weiss. In particular, for the class of chaotic signals, the upper 

, bound is tf < log2 T/H in the limit T — )■ oo, with H being entropy in a sense that 

is explained in the text. 

If \x{T — s) — x{t* — s)\ is small for < s < r, where r is of the order of a 
^ characteristic time scale, the pattern of events leading up to t = T is similar to 

O the pattern of events leading up to t = t*. It is reasonable to expect x{t* +tf) 

to be a good predictor of x{T + tj). All existing methods for prediction use this 
idea in some way or the other. Unfortunately, this intuitively reasonable idea 
is fundamentally deficient and all existing methods fall well short of the Wyner- 
^ Ziv entropy bound on tf. An optimal predictor should decompose the distance 

between the pattern of events leading up to t = T and the pattern leading up to 
^ t = t* into stable and unstable components. A good match should have suitably 

small unstable components but will in general allow stable components which are 
^ as large as the tolerance for correct prediction. 

An optimal predictor for chaotic signals should have three properties. First, it 
should achieve the Wyner-Ziv entropy bound. Second, it should look something 
like the classical Wiener-Kolmogorov predictor when the signal has zero entropy 
and no positive Lyapunov exponents. Third, it must be numerically stable. For 
the special case of toral automorphisms, we use Fade approximants and derive 
a predictor which has these properties and which seems to point the way to the 
derivation of a more general optimal predictor. 
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1 Introduction 

Prediction of signals is a basic problem in science. In canonical problems of science such 
as planetary motion or the spectral theory of matter, the prediction problem is tackled 
by understanding the phenomenon that produces the signal and formulating a physical 
model for it. However, the need to predict signals remains in many instances where a 
physical model is unknown or is too complicated to be of practical use. 

The problem of predicting signals without deriving physical models was first formu- 
lated independently by Kolmogorov and Wiener |23]- The Wiener- Kolmogorov predic- 
tors are linear and assume that the signal is characterized solely by its auto-correlation 
function. Under these assumptions, the predictors minimize the root mean square error 
and are optimal in that sense. A key assumption in the theory is that the signals are 
stationary. 

Chaotic signals, which are signals obtained from dynamical systems that exhibit 
sensitive dependence on initial conditions, are stationary as well. However, the charac- 
terization of these signals by their auto-correlation functions is unsatisfactory, and the 
Wiener-Kolmogorov predictor should not be expected to and does not perform well at 
all. Available methods for the prediction of chaotic signals do not use spectral decom- 
position. These methods are based on recurrence and embed the signal in phase space 
in one way or another PQ El El ED] • 

The contribution of this paper is three-fold. Firstly, we show that all current methods 
for predicting chaotic signals are necessarily suboptimal. Current predictors are lacking 
in an important respect, as we will show. Secondly, we make a firm and significant 
connection between the problem of predicting chaotic signals and a major result in 
modern information theory. Thirdly, we investigate ideas that point the way to the 
derivation of an optimal general purpose predictor for chaotic signals. 

The auto-correlation of a real valued signal x{t) is defined as 



The auto-correlation function is of little use in predicting a chaotic signal. However, 
some of the mechanics of the Wiener predictor, especially with regard to its use of 
the auto-correlation function, reappear in the Pade predictor that we will derive later. 
The discussion of the Wiener predictor and its limitations in the simplied treatment of 
Bode and Shannon [5] is illuminating and merits careful study. Bode and Shannon ask 
themselves how much is lost by the restriction to linear predictors and answer as follows: 

The fact that nonlinear effects may be important in a prediction can be 
illustrated by returning to the problem of forecasting tomorrow's weather. 
We are all familiar with the fact that the pattern of events over a period 
of time may be more important than the happenings taken individually in 
determining what will come. For example, the sequence of events in the 
passage of a cold or warm front is characteristic. Moreover, the significance 
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of a given happening may depend largely upon the intensity with which 
it occurs. Thus, a sharp dip in the barometer may mean that moderately 
unpleasant weather is coming. Twice as great a drop in the same time, on 
the other hand, may not indicate that the weather will be merely twice as 
unpleasant; it may indicate a hurricane. 

One of the authors of this passage, who grew up near the 45th parallel where the annual 
snowfall is more than 10 feet, may have watched the barometer during the long and harsh 
winters. For our purpose, the significant point here is the emphasis on "the pattern of 
events." When the authors speak of the sequence of events in the passage of a cold or 
warm front being characteristic, it appears as if they had part of the intuition behind 
phase space reconstruction using delay coordinates. 

Many methods have been proposed for predicting chaotic signals. All the current 
predictors known to us use delay coordinates. Suppose the real valued signal x{t) is 
obtained as x{t) = b^X{t), where X{t) takes values in M'^ and 6 G M'^ is constant. 
Suppose X{t) satisfies the dynamical system X = f{X). Delay coordinates are an 
attempt to reconstruct the dynamics of X in M.'^ using the scalar signal x{t). Even 
though an individual signal value such as x [to) may give little idea of X{to), the pattern 
of events x (to) , x {tQ — t) , . . . x {to — {k — l)r) can be used to stand as a substitute for 
X (to) and to reconstruct dynamics in phase space for suitable values of the delay r and 
the embedding dimension k [SI EI] • The idea of using coordinates in this manner can 
be traced back to some embedding theorems in topology, and we will have more to say 
about this connection later. 

For the most part, our discussion of existing predictors is restricted to one of the 
most basic ones [H]. The prediction method in its most basic form is as good as any of 
its many variants as far as optimal prediction is concerned, as we argue in Section 5. We 
refer to this basic form of the predictor, which we describe in Section 5, as the embedding 
predictor for predicting chaotic signals. The embedding predictor matches the pattern 
of events at the present with some time in the past, and uses the best match to predict 
the future. In that respect, it stays close to the logic of Bode and Shannon, while using 
delay coordinates to obtain a concrete realization of their notion of the pattern of events. 

Perhaps the central point of this paper is that matching "the pattern of events" is 
not the best way to predict chaotic signals in spite of its indubitable intuitive appeal. 
This point is illustrated in Figure [TJ Before explaining that figure, we set down some 
notation that will be used throughout this paper. If the signal is x{t), the current time 
is always denoted by T. It is assumed that the signal is recorded from t = and that the 
stretch of signal that is available is x{t) for < t < T. The task is to use the available 
history, which is x{t) for < t < T, to predict x{T + 1) for < t < tf for as large a 
value of tf as possible. 

In the two plots of Figure [T| the thin black lines show a chaotic signal obtained from 
the Lorenz system. The plots show only a part of the signal and T is given as 2^° symbols. 
Each symbol is equal to tretum = 0.7511 units of time, where tretum is the average time 
from one "turning point" to another. A turning point is defined as a peak or a trough of 
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Figure 1: In both plots, the current time T is 2^° symbols. The Lorenz signal, which is 
shown as a thin black line, is the same in the two plots. The thick red line is: (a) best 
fit from the past; (b) suboptimal prediction using the embedding predictor. 
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the graph of x{t) with only peaks or troughs with \x{t)\ > 6^/2 being counted. Since the 
two fixed points that are located in the holes in the wings of the Lorenz attractor have 
coordinates (±6v^,±6v^, 27), turning points defined in this way are in correspondence 
with intersections of the signal with Poincare sections of the Lorenz attractor [22] • In 
the plots, T is given as 2^° symbols, which means T = 2^° x 0.7511 and that the number 
of turning points of x{t) in [0,T] is approximately 2^°. In the plots, the thin black lines 
go beyond T to show how the signal develops so that we can visually assess the quality 
of the predictions. 

The thick red lines in the two plots are obtained differently. In the top plot, we fix 
a tolerance tol (the precise value of tol is unimportant for the discussion here) and look 
for t* G [0, T — treturn] such that the length of fit, which is 

length of fit at t* = largest tf such that \x{T + s) - x{t* + s)| < tol for s E [0, t/], (1.2) 

is maximized. The maximum value of the length of fit is denoted by tbest- Here we are 
looking into the future of the signal and trying to find the moment t* in the past which 
agrees with the signal's future for the maximum period tf (within a specified tolerance). 
This method of determining t* and the maximum length of fit thest will be called the 
best fit from the past. Since it looks at x{T + s) for s > 0, the best fit from the past is 
not a predictor. 

We see that the best fit from the past in Figure [T^ follows the signal for t > T for 
more than 20 symbols. It is not difficult to see why predictor can follow the signal for 
longer. If we fit the signal starting at x{t), < t < T — tretum, to the signal starting at 
x{T) the fit will extend from T to T + tf for some tf and then start diverging. The rate 
of divergence beyond T + tf will be exponential as the signal is from a chaotic source. 
By definition of tbest, we have tf < tbest- Thus the past has no information about what 
happens to the signal beyond T + tbest and no amount of algorithmic legerdemain can 
synthesize that information. 

Thus it should be clear that the basic task of a predictor is to find t* which maximizes 



(1.2) or another t* which nearly maximizes it without looking into the future. Let us 
see how the embedding predictor goes about predicting the signal. For the discussion 
here, a brief account of the embedding predictor suffices. A more detailed discussion 
including extensions and modifications of the basic predictor will be given later. The 
embedding predictor works by finding t* G [kr, T — tretum] such that 

J2(x{T-pT)-x{t*-pT)f (1.3) 

p=0 

is minimized. There is much literature about the choice of the delay parameter r and 
the embedding dimension k (see [1] for instance). We will assume that r and k are 
suitably chosen (with rk about a fifth of a symbol). The prediction of x(T + s) is taken 
to be X (t* + s). 



How well does t* which minimizes (1.3) work in terms of maximizing (1.2)? Before 



answering that question, let us ask ourselves why there should be a connection at all 
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between finding t* to minimize (1.3) and finding it to maximize the length of fit defined 
by (1.2). When we minimize (1.3), we are looking for a t* such that if we walk back from 
t = t* the portion of the signal we see looks much like what we see when we walk back 
from t = T. In other words, the pattern of events leading up to t = t* should look like 
the pattern of events leading up to t = T. The hope is that if the events immediatedly 
preceding t = t* look like the events immediately preceding t = T, the signal value 
x(t* + s) will be a good predictor of x(T + s). 

Unfortunately, this intuitively reasonable idea is fundamentally deficient. To see 
why, we go back to Figure [l] on page |4| The thick red line of part (a) of that figure 
is obtained by shifting t*, which corresponds to the best fit from the past, to coincide 
with T to permit comparison between the two patterns. The thick red line of part (b) 
is obtained by shifting t* found using the embedding predictor to T. In part (a), we see 
that the sequence of events leading up to t = T and t = t* are not close at all. Yet the 
two portions of the signal slam into each other at T and follow each other for more than 
twenty symbols. In part (b), on the other hand, the sequence of events leading up t = T 
and t = t* are actually quite close. If we walk backwards, the pattern of events matches 
for three symbols. Yet the fit into the future is not half as good as in part (a). 

As we explain in Section 6 and later, the distance between the pattern of events 
leading up to t = T and the pattern of events leading up t = t* must be decomposed 
into stable and unstable components. For optimal prediction or for the best fit from the 
past, the unstable components must be made suitably small by weighting with the cor- 
responding Lyapunov exponents, as will be explained. However, the stable components 
will in general be as large as the tolerance for an acceptable fit or prediction allows. 

If we split the distance between the pattern of events leading up to t = T (black 
line in Figure [l^) and the pattern of events leading up to t = t* for the best fit (thick 
red line with t* shifted to T in Figure [l^), the distance between the two patterns has 
a noticeably substantial stable component but a small unstable component. However, 
the stable component decreases exponentially fast beyond t = T which means the two 
signals slam into each other. The small unstable component allows the fit between the 
two signals to persist for the longest interval of time. 

In Figure [T]d on the other hand, both the stable and unstable components of the 
distance between the two patterns is small, which means that the pattern of events 
leading up to t = T and t = t* are noticeably much closer. However, the unstable 
component of the distance between the two patterns is not as small as in Figure [T^. 
Therefore, when we advance beyond t = T the red line diverges from the black line 
much earlier. 

The situation shown in Figure [T] is typical. Because of the nature of chaotic signals, 
best fits tend to slam into the signal ai t = T and diverge rapidly beyond t = T + 
tbest- This introduces a fundamental asymmetry between the immediate past and the 
immediate future. Good agreement in the immediate past does not imply that the two 
portions of the signal will agree closely in the future. 

Current predictors for predicting chaotic signals try to find a t* such that the pattern 
of events leading up to t = t* closely resembles the pattern of events leading up to t = T. 
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If the goal is to predict the signal as far into the future as possible, that is not the right 
idea. The right idea for an optimal predictor is to evaluate if the pattern of events 
leading up to i = t* and t = T are such that the two patterns will come close to each 
other in the future and to calculate for approximately how long they will remain close. 
Such a calculation requires us to decompose the distance between the two patterns into 
stable and unstable components. 

Our notion of optimality is the length of the fit into the future, which is defined 



by (1.2). Another notion of optimality may be to predict x{T + to) as accurately as 
possible for a fixed to- The notion of optimality that we adopt exposes the basic aspects 
of the prediction problem much better. Indeed, an optimal predictor according to our 
notion of optimality, such as the one we derive for toral automorphisms, can be easily 
modified to be optimal in the other sense. The basic point we make, which is the need 
to decompose the distance between two patterns into stable and unstable components, 
is valid no matter how the prediction problem is posed. 

In our discussion here, we have stated that t* is found in the interval [0, T — tretum]- 
Since tretum is equal to a symbol and the fits can extent for more than 20 symbols as 
evident from Figure [T| it may seem that the interval needs to be bounded away from T 
by a greater amount. That is a minor point, because t* is nearly always smaller than T 
by a large multiple of tretum- If we want to be certain, we can take the interval for t* to 

be [0,T - lOOtretum]- 

Section 2 states a theorem of Wyner-Ziv [23] and Ornstein- Weiss [IH] and Sections 3 
and 4 develop the implications of the entropy bound in that theorem to the prediction 
of chaotic signals . Heuristically, the theorem says that 

, . tbest 1 

lim — = — : 

T^oo log2 T H 

with probability 1. Here H is entropy in a sense that will be described. A predictor is 
optimal if it predicts the signal in the interval T,T + tf^ and 

liminf ; — ^— > 



T^oo \0g2T H 

with probability 1 and for any e > 0. In Section 5, we discuss current predictors and 
point out why they are necessarily suboptimal. From Section 6 onwards, we develop a 
few ideas that take us closer to a general purpose optimal predictor for chaotic signals. 

More introductory remarks can be made about the contents of the later sections. To 
bring this introduction to a close, we defer those remarks to later sections and point out 
a few aspects of the prediction problem that we do not consider. 

The signals considered here are assumed not to be noisy, although it is true that 
the signals are noisy in experimental situations. It is difficult to imagine how optimal 
predictors for noisy chaotic signals can be derived when they are unavailable for noiseless 
chaotic signals. If the underlying properties of chaotic signals which enable and limit 
prediction are not understood, it does not seem possible to put prediction theory of 
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noisy chaotic signals on a secure foundation. A penetrating investigation of the effect of 
noise on chaotic signals is due to Lalley [T7j. Lalley's algorithm D uses delay coordinates 
much like many other investigations [U El 12] • While proving the validity of algorithm 
D as a method for removing the effect of noise, Lalley recognizes the need to require the 
width of the window used by the delay representation to increase at a sub-logarithmic 
rate. 

We limit ourselves to stationary signals and assume that no physical model of the 
source of the signal is available. For the possibilities created when something is known 
about the physical model of the source of the signals, one may consult the use of Bayesian 
models in the striking discoveries of Brown and others 

We limit ourselves to a single signal. The early work of Wiener [23] already considered 
ways to improve prediction when several signals are recorded perhaps at different points 
in space. As Weiner explained, given the tendency of weather to move east, Chicago 
weather may be more improtant for predicting Boston weather than Boston weather 
itself. We mention the intriguing work of Chernyshenko and Bondarenko [7] where it 
is shown that an entire turbulent velocity field can be recovered using only 2% of the 
modes. Remarks pertaining to spatially extended signals are found in the concluding 
section. For a study of the ergodic nature of turbulent events, see [TS]. 

2 Theorem of Wyner-Ziv and Ornstein-Weiss 

In this section, we describe three results that apply to stationary and ergodic sequences: 
the Poincare recurrence theorem, a theorem of Kac, and the entropy theorem of Wyner- 
Ziv and Ornstein-Weiss. Each of these results is pertinent to source coding and, as we 
will show, to the prediction of chaotic signals. 

The notion of stationarity can be defined for a sequence of random variables or for a 
dynamical system. Since our interest is in the prediction of signals, we begin with the 
definition for a sequence of random variables. A sequence of random variables 

-^0; Xl, X2, . . . 

is stationary if 

for any Borel measurable subset B of ]R°°. The definition captures the idea that the 
mechanism underlying the stochastic process does not change with time. 

A stationary sequence is ergodic if every invariant event has probability or 1. Events 
phrased using means and correlations of the sequence are examples of invariant events. 

For an alternative definition, let T : Q Q he a, measurable transformation that 
preserves the probability measure /i on Q. The set A C f2 is invariant if T~^A = A. 
The transformation T is ergodic if ^{A) = or fi{A) = 1 for every invariant set A. The 
ergodicity condition precludes the dynamics from getting stuck in a part of phase space. 

The Poincare recurrence theorem does not assume ergodicity. 
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Theorem 2.1 ((Poincare recurrence [IS])- Assume Xq to be fi- distributed and define 
the stationary sequence Xq, Xi, . . . with Xn = T"(Xo) for n = 1,2, . . . For a measurable 
subset BofQ with ^i{B) > 0, Xq G B implies X„ G B infinitely often with probability 1. 

Suppose a long stream of text is modeled as a stationary sequence of characters and 
suppose that the set B is chosen to prescribe the first ten characters of the text. The 
theorem then asserts that the sequence formed by the first ten characters will repeat 
again and again. The origin of the sequence Xq can be taken anywhere in the text. 

If f2 is the phase space of a dynamical system, the theorem asserts that the dynamical 
system will revisit the same region B in phase space infinitely often. These revisitations 
are the basis for predicting chaotic signals. 

The Poincare recurrence is not quantitative. It does not tell us by what factor a long 
stream of text can be compressed if the repetitions are exploited or how well a chaotic 
signal can be predicted by tracking the recurrences. The first step to a quantitative 
version of the Poincare recurrence theorem is a lovely theorem of Kac. This theorem 
assumes the sequence to be ergodic. 

Theorem 2.2 (Kac's theorem [IS])- Suppose that the sequence Xo,Xi, ... is stationary 
and ergodic. Let i? C M with ¥{B) = P (Xq G -B) > 0. Let n > 1 be the smallest integer 
such that X„ G B. Then E {n\Xo G fi) = 1/F{B). 

Kac's theorem says that the expected time to return to the set B is exactly equal 
to the inverse of the probability of B. One would expect the recurrence time to sets of 
smaller probability to be greater. While the elegance of Kac's theorem may lead one to 
suspect that the theorem should be obvious or easy to demonstrate, a perusal of Kac's 
ingenious proof will dispel such a misperception. 

The entropy theorem stated below characterizes recurrences more sharply than Kac's 
theorem. It applies to sequences which are stationary, ergodic, and take values in a finite 
alphabet. The restriction to finite alphabets does not cause such a great loss of generality 
because information is fundamentally discrete in nature. Chaotic signals are real valued 
and often continuous in time. Yet we may obtain a notion of optimality of prediction of 
chaotic signals using the entropy theorem, as we will show in the following sections. 

Theorem 2.3 (Ornstein and Weiss [l9j). Let Xq,Xi,... be a stationary and ergodic 
sequence, in which each Xn takes values in a finite alphabet A. Let tbest be the greatest 
integer such that Xt+i, ■ ■ ■ , Xx+t^^^^ occurs as a contiguous subsequence of Xq, . . . , Xt- 
Then 

, . tbest 1 

hm — = — : 

T^oo log2 T H 

with probability 1. Here H is the entropy of the stationary, ergodic process Xq,Xi,... 



Theorem 2.2 tracks the re-occurence of an event associated with Xn for some X„ 



with n > 0. Theorem 2.3 checks if an event that follows the current symbol Xt repeats 



a past event. We will refer to either scenario as a recurrence. 
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Theorem |2.3| is a remarkable sharpening of the Poincare recurrence theorem. If we 
regard T as current time and that observations begin at 0, as we do throughout this 
paper, it gives a perfect characterization of the extent to which the pattern that will 
follow T will repeat some pattern of events we have seen in the past. The 1/H bound was 
first stated by Wyner and Ziv [21j, who were able to prove the convergence of tbest/ log2 ^ 



to 1/H in probability. Almost sure convergence of the type asserted by Theorem 2.3 
was proved by Ornstein and Weiss ^J^. 

The distinction between convergence in probability and almost sure convergence is 
pertinent to the prediction of chaotic signals. If predictions of weather or of hurricane 
tracks or of cardiac signals are to be really useful, the prediction method should apply 
to almost every signal and not only to a fraction of the signals that occur in practice. 
The distinction between almost sure predictions of individual signals and statistical 
predictability has not been made in extant work on the subject. Existing predictors of 
chaotic signals have been validated generally with statistical notions of accuracy such 
as mean square error and correlation plots [HI ED]- Our discussion of predictability of 
chaotic signals will be framed in terms of almost sure predictability. 

Since X„ takes values in a finite alphabet A for n > 0, we refer to each value as a 
symbol. The entropy H is defined as follows. Suppose we consider the following block of 
symbols of length m: Xq, . . .X^-i- This block can take \A\"^ different values. Suppose 
the probabilities of the different possibilities are Pi,P2, ■ ■ -Pm, where M = Then 

H = lim — —pi logo Pi. 

i=l 

In dynamics, natural logarithms are used instead of logarithms to base 2. We will follow 
the information theory convention and use logarithms to base 2. 

The definition of entropy comes up in a natural way when we try to count states. 
Suppose we look at all |^|™ possible values of the sequence Xq, . . . Xm-i- Some possible 
sequences are more probable and some are less probable. How may possible sequences 
have a probability that is approximately that of the average? The answer is 2"^^. 
The entropy theorem of Shannon and others asserts that a sufficiently long segment of 
Xo,Xi, . . . looks like an average sequence with probability 1. Therefore to transmit m 
symbols from such a stationary and ergodic source, we may be able to get by using mH 
bits but no less. An optimal compression of the source will use mH bits to encode m 
symbols asymptotically. 

Entropy comes up in statistical mechanics while counting the number of states of a 
system. The interpretation of entropy in terms of information originated with Shannon's 
source coding theorem. However, the coding scheme implicit in Shannon's theorem, 
which is to use long block codes, is useless in practice. The widely used source coding 
scheme of Lempel and Ziv relies on an entirely different interpretation of entropy, which 



is the interpretation given by Theorem 2.3 



Theorem 2. 3| interprets entropy in terms of the maximum segment following Xt that 



occurs as a subsequence of the segment preceding it. The entire segment following 
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Xt can be encoded using a pointer to some place in the past and the length of the 
recurrence. Various source coding schemes based on that idea have been derived by 
Lempel, Ziv and others and have been widely used for decades. The distinction between 
almost sure convergence and convergence in probability is important for the practical 
success of these coding schemes. 

In normal use, entropy theorem refers to the entropy theorem of Shannon. In this 



paper, entropy theorem and entropy bound will refer to Theorem 2^ This convention 
saves us the trouble of using four names everytime we need to refer to the theorem and 
the bound contained in it. 

If we look at the entropy theorem in the light of prediction, it appears as if logg T/H 
sjTiibols can be predicted using a history of length T. The fallacy behind that surmise 
becomes evident if we consider an i.i.d. sequence made up of ±1, where each sign is 
equally probable. The entropy of such a sequence is 1. As the entropy theorem asserts, 
we may expect log2 T symbols that follow a history of length T to form a segment that 
repeats a segment that has already occurred. That type of repetition is useless for 
prediction. Given a knowledge of the history of the signal up to Xt, all that we know 
about Xt+1 is that it is equally likely to be +1 or —1. Knowledge of history is useless 
in the prediction of i.i.d. sequences. 

Thus we need to precisely delineate the nature of chaotic signals which makes the 
entropy theorem relevant to their prediction. In Section 3, we describe the notion of 
entropy for chaotic signals, and in Section 4, we explain why the entropy theorem defines 
the limit of predictability of chaotic signals. At the end of Section 6, we describe what 
form on optimal predictor should take. While currently available predictors do not take 
that form, in the rest of the paper, we describe a few ideas that suggest that optimal 
predictors can in fact be derived. 

3 Applicability of the entropy theorem to chaotic systems 

Stationary and ergodic sequences can be generated in many ways. An i.i.d. sequence 
Xq = ±1, Xi = ±1, . . . with = p(— 1) = 1/2 is stationary and ergodic. Suppose 

we form another sequence Yn with y„ = 1 or 1^ = — 1 according as +1 or —1 is the 
majority among the seven entries X„, . . . , Xn+e- Then the Yn sequence is also stationary 
and ergodic. Regardless of the length of history neither the X„ sequence not the Yn 
sequence is predictable in the manner we consider. For notions of prediction pertinent 
to such signals, see PH] . 

Suppose Xn+i = f (Xn) is a dynamical system. The phase space of the the dynamical 
system can be any Riemannian manifold but for convenience we will assume it to be 
a subset of M"^. Let /i be a probability measure that is invariant with respect to the 
dynamical system. If Xq has fi as its distribution and Xn+i = f {Xn) for n = 0, 1, . . ., 
the sequence Xq,Xi,. . . is stationary. If fi is indecomposable (an assumption we will 
always make), the sequence is ergodic as well. 

It is evident that a stationary and ergodic sequence Xo,Xi,... generated in this 
manner is quite different from an i.i.d. sequence of the type ±1, ±1, . . . While the i.i.d. 
sequence generates a random number for every new entry, in a stationary and ergodic 
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sequence derived from a dynamical system, every new entry is generated deterministi- 
cally. 

We do not assume the entire state vector X„ to be observable. The observed sequence 
is Xq, Xi, . . . where real- valued function of X^. For example, x„ can be some 

component of X„. This framework should be sufficiently general to allow for siesmic 
signals, ECG signals and so on. Nearly all the theoretical discussion will be restricted to 
maps to avoid some of the technicalities that arise for flows. For both maps and flows, 
the dynamical system that generates the signal is assumed to be unknown. 

One of the examples we consider is a signal obtained from the Lorenz flow: 



The Lorenz system has fixed points at (0, 0, 0) and (±6\/2, ±6a/2, 27). The two nonzero 
fixed points sit in the middle of two holes in the wings of the butterfly-shaped attractor's 
two wings. The signal is generated by accurately integrating a random point {x',y',z') 
for some time to generate the initial point (x(0), ?/(0), z(0)). The initial point generated 
in this may be assumed to be fi distributed, with /i assumed to be the physical measure 
of the Lorenz attractor. The signal x{t) is generated for t > by integrating this initial 
point. For the purpose of prediction, it is assumed that the model which generates the 
signal is unknown. 

To apply the entropy theorem to the Lorenz signal x{t), we need to specify the entropy 
of the Lorenz signal. We recall a few of the theoretical deflnitions related to the entropy 
of a dynamical system. For complete details, see [I^ or [25J- Let / : R'^ -> M"^ be a 
smooth dynamical system and let A be an invariant set. Let /x be a probability measure 
on A that is invariant with respect to /. Assume that / is an ergodic transformation of 
A with respect to the measure /z. In this setting, the deflnition of metric or Kolmogorov- 
Sinai entropy is quite simple. Let P be a flnite partition of the set A. We can generate 
a finite-valued stationary ergodic process as follows. Pick Xq according to fi and take 
Xn+i = f (Xn) for n = 0, 1, . . . Let Yn be the partition in V that X„ belongs to. Then 
the finite valued process Yn is stationary and ergodic, and as such has a Shannon entropy 
which we denote by h^^f.V). In general, h^{f,V) can depend upon the partition V. 
The metric entropy h^{f) is defined as the maximum over all finite partitions V. 

At first sight, it might seem as if the metric entropy can depend upon V. However, 
this dependence is not as severe as one might think. For example, one may modify 
V to the finer partition V \/ V, where the finer partition keeps track of the partitions 
in V that x and its iterate f{x) belong to. Even though V \/ V is a, finer partition, 

{f,V y V) = h^{f,V) because it is readily evident that combining the n-th and the 
(n + l)-st symbols into a single symbol in the n-th position will neither increase nor 
decrease the information per symbol. In fact, = h^{f,V) if the partition V is 



dx 

It 
dy 

dt 

dz 

dt 



28x — y — xz 



10(2/ -x) 



Sz/3 + xy. 
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tbest 


Matching sequence 


log2T 


ibest 


Matching sequence 
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13 


AAAABAAABBAAA 
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3 


BBB 
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ABBBABBAAAAAAABA 
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17 


BBBABBBABABBAABAB 
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AAABBBAAABABAAABAAAA 
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14 


AABAABAAAAAAAA 


18 


24 


BABBABBBABBBBABBABBABABB 
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9 


BBBAABBAA 


19 


20 


BBBBABABBBAABAAAABAA 


10 


10 


BABABBBBAB 


20 


16 


ABBBAABABAAAABA 


11 


9 


ABBBAAAAA 


21 


30 


BABBBBAABABBAAAABBBBAAAAAABABB 



Table 1: Recurrences of a Lorenz signal. 



generating. Generating partitions are defined using conditional entropy [TB]. If the 
partitions in P V ... V P become fine enough to closely approximate any given partition 
Q of A, the partition V is generating. 

Later the theoretical discussion will focus on hyberbolic attractors A. For such 
invariant sets, Markov partitions are generating. But now we will explain how the 
concept of metric entropy allows us to apply the entropy theorem to Lorenz signals. 

Table [T] shows a calculation of ttest, in accord with its definition in the entropy 
theorem (Theorem 2.3[ ), using a Lorenz signal. The symbols A and B have the fol- 



lowing meaning. Every intersection of the Lorenz signal with the "quarter" plane 
X < —6^/2, y < —6^/2, z = 27 is recorded as the symbol A and every intersection 
with X > 6a/2, y > 6-\/2, 2; = 27 is recorded as the symbol B. In this manner the Lorenz 
signal is turned into a stationary and ergodic sequence of ^s and Bs. For evidence that 
the partition into A and B is generating, see [22] • 

A convenient way to estimate the entropy of the sequence of As and -Bs is using 
Lyapunov exponents. Lyapunov exponents are the exponential rates with which in- 
finitesimal perturbations to a point on A grow or decay. For a definition, see [16]. The 
standard definition uses natural logarithms and not logarithms to base 2 as in the case 
of entropy. If the system is of dimension d, there are exactly d Lyapunov exponents 
with repetitions. With probability 1 with respect to the measure /i, these are the only 
possible rates of growth or decay. 

If the Lyapunov exponents are Ai, . . . , A^, the metric entropy satisfies 

/^M < E (3-1) 

Ai>0 

This is Ruelle's inequality [23] (the same logarithm must be used in defining and the 



Lyapunov exponents Aj). In some cases, equality holds in (3.1). 

For the Lorenz system, the continuous time Lyapunov exponent is approximately 
0.905 (using natural logarithms). The average time from an intersection with one of the 
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19 
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11 
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BABBABAABB 


21 


20 


ABABBBBBBBBBABABBAAA 



Table 2: Recurrences of flips of a fair coin. 



quarter-planes A or i? to another is tretum = 0.7511. By Ruelle's inequality (3.1), the 
entropy of the sequences of As and -Bs is bounded above by 0.905 x 0.7511/ log 2 = 0.98. 
Very probably the entropy is quite close to 0.98. Table [T] appears to be in agreement 
with this estimate of the entropy. 



Table [2] tabulates thest (defined as in Theorem 2.3) for tosses of a fair coin (with A 
for heads and B for tails). The entropy of the coin toss process is 1 and very close to 
the entropy of the Lorenz signal. Yet Table [2] looks quite different from Table [T] The 
fluctuations of tbest are more pronounced for the Lorenz signal. For the special case of 



i.i.d. sequences such as coin tosses, Theorem |2.3| was proved by Erdos and Renyi. 

The intersection with the quarter-planes A and B are recorded using the symbols 
A and B. For continuous time Lorenz signals x{t), one may use the the average time 
between symbols tretum = 0.7511 as the unit. Following that usage, the value of the 
current time T for the two plots in Figure M are reported as 2^^ and 2^^ symbols. 



When we think of the Lorenz signal as a sequence made up of the symbols A and 



B, it is natural to define tfeest as in the entropy theorem (Theorem 2.3). However, for 



continuous time signals the definition of t^est which follows (1.2) is more natural. We 
take 

tol = 5 (3.2) 

to be the tolerance for Lorenz signals throughout this paper. Table [3] reports tbest with 
tol = 5. The tbest numbers with tol = 5 are somewhat smaller than the tbest numbers 
in Table [TJ This is because tol = 5 is a stiffer requirement than simply requiring the 
symbol sequences to match. When other methods are compared to the best fits in Table 
|3] later, the length of match is reported in symbols but not as a real number. 

4 Recurrence and predictability 

Suppose we are trying to predict a signal Xq, . . . ,xt- The entropy theorem says that 
tbest ~ ^og2T / H for large T. Thus it appears the past of the signal does not have 
sufficient information to predict XT+t for t > log2 T/if. We expect that no algorithm 
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-10 
-20 

T-10 T T+10 



20 
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-20 

T-15 T T+15 

Figure 2: Best fits from tlie past (in tliick red) to a Lorenz signal (in tliin black), (a) 
T = 2^^ symbols, (b) T = 2^^ symbols. 
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12.41 
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17 
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14 


10.54 
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19 


18 


13.22 


10 
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6.69 


20 


22 


16.47 


11 


8 


5.92 


21 


25 


18.91 



Table 3: Best fits to a Lorenz signal, where Uest in symbols equals tbest as a real divided 

by t return = 0.7511. 
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can predict XT+t for t > (1 + e) log2 T/if for e > 0. In this section, we formalize this 
claim to some extent to bring out in outline what form the proof of such a claim might 
take. 

As in Section 3, the sequence Xq, . . . ,xt is assumed to be generated from the state 
vectors of a dynamical system X^+i — Since our aim is to upper bound the 

extent of predictability of the sequence, we may, without loss of generality, assume the 
entire state vector to be observable. We assume that the map / possesses a hyperbolic 
attractor A. We assume that / is transitive on A. Within a hyperbolic attractor, 
periodic points are dense and therefore a hyperbolic attractor satisfies the Axiom-A 
conditions. 

We define a predictor as a measurable function and write it as P {Xq, . . . , Xt) — 
(^Xt+1, Xt+2, • • •) ■ The measurable function P captures our notion of an algorithm which 
will take the T successive state vectors Xq, . . . ,Xt and will generate approximations 
Xt+s to Xt+s for s = 1, 2, . . . The algorithm is not required to output an approximation 
for every s > 0. We will assume that it outputs approximations for s — 1,2, ... ,tf. 

At this point, we have to decide when a prediction is termed as valid. One possibility 
seems to be to use Markov partitions. Markov partitions use the stable and unstable 
manifolds of the hyperbolic set A to partition A into finitely many pieces. Markov 
partitions of arbitrarily small diameters are guaranteed to exist. The key advantage 
of Markov partitions is that they facilitate symbolic encoding of the dynamics of / on 
the hyperbolic attractor A. The way the encoding works is similar to the encoding 
of Lorenz trajectories using the symbols A and B, which we explained in the previous 
section. However, defining the validity of predictions using Markov partitions seems 
problematic. Two points which arc very close to each other can fall in different Markov 
partitions if they happen to lie on opposite sides of the boundary defining the partitions. 

Thus we go back to using a tolerance to define a valid prediction. A prediction Xt+s 
is deemed to be valid if Xt+s — Xt+s < tol for some tolerance tol. 

We require the prediction algorithm to output X^+s as a valid prediction for s = 
1, . . . ,tf. In other words, each prediction output by the prediction algorithm P must 
be valid. Alternatively, we can allow the prediction algorithm to output anything it 
wants and define by counting only the valid predictions in the segment that immedi- 
ately follows t — T. At this point, there seems to be little to choose between the two 
possibilities. So we adopt the more restrictive definition of a prediction algorithm. 

Now our claim can be stated as follows: if P is a valid prediction algorithm 

hmsup— — 4.1 

T^oo l0g2 T H 

with probability 1 for any e > 0. The notion of entropy H that we adopted in the 
previous section was metric entropy h^{f) relative to the physical measure /x on A. For 
a hyperbolic attractor, the physical measure is the SRB measure and it is guaranteed to 
exist. Thus we are assuming Xq to be ^-distributed, Xi — f (Xq), X2 — f (Xi), and so 
on. But it is the same prediction algorithm for any / and any hyperbolic attractor A. 
It is not allowed to assume information about /. 
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In order to explain why every prediction algorithm must satisfy the bound (4.1), we 
turn to another notion of entropy, namely topological entropy htop{f) [IS]- To begin with 
we have a metric d on A. We can define dn{x, y) to be the maximum of d /*(?/)) 
over i = 0, . . . ,n — 1. If N{6,n) is the number of S balls required to cover A in the 
metric dn, topological entropy is defined using the relation N{d, n) ^ C2"''^*°p for small 
S. It is independent of the metric. In general, h^{f) < htop{f) (see Theorem 4.5.3 of 
[IB]). With the assumptions we have made on A and /i, /i^ = htop- 

Suppose we are given the sequence Xq, . . . , Xt- That is equivalent to assuming that 
we know the iterates /, /^,. . ., at T — n + 1 points on A. For example, we know 
/(Xi) = X2, /^(X2) = X5, and so on. We are assuming n to be of the order of loggT. 
These T — n + 1 points on A at which n iterates of / are known may be assumed to 
be approximately /i-distributed. To predict n iterates of Xt with tolerance tol = 6 
from that information, we require one of the points Xq, . . . , Xx-n to be within 6 of Xt 
in the dn+i metric. For such a thing to be possible, we require T — n > (72"'**°*' or 
n < log2T/ htop asymptotically. 

It may seem that one may extract some more information about /" (Xt) by clever 
interpolation of /" whose value is known at Xo,...,Xt_„. It is true that clever in- 
terpolation can improve the accuracy dramatically if the function being interpolated is 
smooth. In this context, however, no such thing is possible even if / is infinitely differen- 
tiable or real analytic. The key reason is that the exponential divergence of trajectories 
is enough to defeat any attempt at clever interpolation. 

Perhaps this point will be clearer with an example. The map Xn+i = f{xn) with 
f{x) = 4x(l — x) over the interval [0, 1] has topological entropy equal to 1. Suppose we 
want to predict /" [xt)- Given the shape of /, will have 2"~^ oscillations. By an 
oscillation we mean a monotonic increase in f"'{x) from to 1 and then a monotonic 
decrease to 0. If T < C'2"'*'°f/(i+^) = C2"/(i+^), it is clear that T points will be too 
few to track all the oscillations of No interpolation scheme can make up for that 
kind of undersampling. The consideration here is somewhat analogous to that in the 
sampling theorem. According to the sampling theorem, to reconstruct a band-limited 
signal exactly, we must sample at least twice per wavelength. 

As indicated earlier, the theoretical discussion in this section is restricted to maps. 
However, a new point comes up in relation to flows that is worth mentioning. Suppose 
we have a continuous signal x{t) for < t < T from a real analytic flow. Then x{t) is 
analytic in a neighborhood of the real line. Thus in principle we may use the known 
stretch of the signal to predict it forever into the future using analytic condition. Analytic 
continuation is numerically unstable and often not feasible as an extrapolation strategy. 
Limitations to the applicability of analytic continuation become evident the moment 
we note that the continuous signal must be sampled at some finite rate and that it is 
incorrect to assume the entire signal to be available. A very similar point comes up in 
the context of the Wiener- Kolmogorov predictor. See Section 1.7 of |23j . 

A prediction algorithm P is optimal if 

liminf-^^>^^ (4.2) 
T^oo log2 T - H ^ ^ 
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with probability 1 for any e > 0. Our view of optimality is tied to almost sure prediction 
and not to statistical predictability. The practical significance of almost sure convergence 
is accepted in information theory. See the discussion in [21] for an example. 



5 The embedding predictor, related predictors, and their subopti- 
mality 

The Wiener-Kolmogorov predictors were derived for statistical time series which are 
well-characterized by their auto-correlation functions. They should not be used for the 
extrapolation of smooth curves. In this regard, Wiener wrote [23^, 0.71]: geometrical facts 
must be predicted geometrically and analytical facts analytically, leaving only statistical 
facts to be predicted statistically. 

There are two geometrical facts that are central to the prediction of chaotic signals. 
The first is recurrence and the second is the need to decompose close recurrences into 
stable and unstable components. Existing predictors have exploited recurrence but have 
not attempted to decompose close recurrences into stable and unstable components. As 



a result, they fall well short of being optimal in the sense of (4.2). 

In this section, we discuss a few existing predictors of chaotic signals. Some of the 
ideas used by existing predictors, which we find to be deficient with respect to optimal 
prediction, may become more useful once a good method is found to decompose close 
recurrences into stable and unstable components. For example, polynomial interpolation 
has been suggested and used for limited improvement of the accuracy of predictions of 
chaotic time series. It is of little use in getting closer to optimality. However, if close 
recurrences are decomposed appropriately into stable and unstable components, poly- 
nomial interpolation may indeed be useful for improving the accuracy of the prediction 
of x{T + s), especially for s < aloggT/if, where a is a small fraction. 

Phase space reconstruction using delay coordinates is used by all existing predictors. 
We term the most basic of these predictors as the embedding predictor E] ■ Given a 



signal x{t) for < t < T, the embedding predictor finds t* to minimize (1.3), as we have 
already discussed. The key idea behind embedding predictors is to indirectly recover the 
location of the dynamical system in phase space at time t using the delay coordinates 
{x{t),x(t — r), . . . , x(t — {k — l)'r)). Suppose the state vector of the dynamical system 
at time t = ti is Xi and the state vector at time t = ^2 is ^2- It is quite possible that 
x{ti) = x(t2) even if Xi ^ X2 or that x (ti) — x (t2) is small even if Xi is not close to 
X2. However, the pattern of events preceding t = ti and t = ^2 as recorded using delay 
coordinates gives us better information to decide if Xi and X2 are close to each other 
or not. 

Although the choice of the delay parameter r and the embedding dimension k have 
been discussed extensively, it is difficult to make definite statements about what the 
best choices are. One approach is to use mutual information — see [I]. In this approach 
it is assumed that r should not be too small because nearby values are well-correlated 
and not too large because distant points on the signal are very weakly correlated. Mu- 
tual information is used to find some kind of a compromise. Regarding the embedding 
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Figure 3: Suboptimal predictions (in thick red) using the embedding predictor to a 
Lorenz signal (in thin black), (a) T = 2^'^ symbols, (b) T = 2^^ symbols. 
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dimension k, it is stated that it should be at least as large as the dimension of the un- 
derlying chaotic set. We comment about time series embedding using delay coordinates 
in Section 10. 

Figure |3] shows suboptimal predictions of Lorenz signals using the embedding pre- 
dictor and r = .03, k = 5. Our choice of the delay parameter at r = .03 is much smaller 
than what the mutual information criterion would imply. The mutual information cri- 
terion would imply a r that is large enough to span a few oscillations of the signal. It 
is difficult to see what advantage using information from such distant points may have 
with regard to prediction, where the game is to exploit local information optimally. In- 
deed, use of a larger delay parameter gives no improvement at all. Some of the extant 
discussion about choosing the delay parameter appears to be based on a desire to obtain 
good plots and not good predictions. 

For a study of the effect of the delay parameter on the quality of prediction, see 
Figure 22 of Casdagli et al.jn]. For the Ikeda map, the optimal delay for prediction 
is found to be the smallest delay possible. In Figure 22 of that paper, an attempt is 
made to predict only one iteration using a history that is equal to 10^ iterates in length. 
The entropy theorem indicates that more than 20 iterates could be predictable using a 
history of that length. Here the advantage of defining optimal prediction as in (4.2), 
which we mentioned earlier in the introduction, becomes evident in a more concrete 
way. If we attempt to predict only one iterate, different prediction methods will differ in 
terms of accuracy, but the difference will be quite delicate. Even for predicting a single 
iterate optimally, it is important to resolve close recurrences into stable and unstable 
components. However, the gain in accuracy to be obtained by resolving close recurrences 
in that manner is not easily noticed. In contrast, the optimality criterion (4.2) which 
emphasizes the length of the fit into the future, exposes the central deficiency of existing 
predictors in a way that is quite easy to see. 

If we compare Figure [3] with Figure [2| it is abundantly clear that the embedding 
predictor does not extract the information in the history of the signal in an optimal 
manner. The embedding predictor gives a closer fit in the immediate past of t = T, 
but that is precisely why it does not do the best job of predicting the future. Still from 
Figure |3} we see that the fit into the future is much better than the fit into the past. 
Does the embedding method have a bias to the future after all? The answer is no. The 
embedding method treats the past and the future equally. There is nothing in it to say 
that it is attempting to predict the future rather than fit the past. The better fit into 
the future we see in the figure is a consequence of the Lyapunov exponents of the Lorenz 
attractor. The lone negative exponent of the Lorenz attractor is —14.5 (using natural 
logarithms) and is much larger in magnitude that the lone positive exponent, which is 
0.905. Therefore if we pick two points close to each other on the Lorenz attractor, the 
corresponding trajectories will typically diverge faster in the past. 

The tembed columu of Table |4] is obtained as follows. The metric (1.3) is used to pick 
t* so that the distance between the delay coordinates ai t = t* and t = T is the smallest. 
The length of the fit into the future is given by tembed- \x {t* + s) — x{T + s)| < tol for 
< s < tembed but uot for < s < t with t > tembed- Comparisou of tembed in Table |4] 
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Table 4: Length of suboptimal predictions of a Lorenz signal (tembed) using the embedding 
predictor. 



with tbest in Table |3] shows that the embedding predictor does not approach optimality. 

In the rest of this section, we consider a number of ideas for improving the basic 
embedding predictor. All these ideas have merits. However, to be fully effective, they 
need to take into account an essential aspect of chaotic signals, which is their tendency 
to separate or come together depending upon the relative sizes of the stable and unstable 
components. 

The first idea we mention is from the paper by Farmer and Sidorowich [5]. To 
predict x(T + s) the basic embedding predictor picks a single t* G [kr, T — s] using the 
metric (1.3). Instead, a predictor may pick p different instants t*, . . . , t* where the delay 
coordinates are the p closest to the delay coordinates at t = T. Assuming p > k, the 
prediction of x{T + s) is generated as a linear combination of the delay coordinates at 
t = T by fitting x{t* + s) as a linear combination of the delay coordinates at t = t*, for 
i = 1, . . . ,p, using linear least squares. 

Let us first understand the merit of this idea. Suppose we are looking at a Lorenz 
signal and we fix s = 1, which means we are trying to predict the signal at a point that 
is somewhat more than one return time {treturn = 0.7511) from t = T. For sufficiently 
large T, the signal will have delay coordinates ai t = t* close to that at t = T for each 
of the p values of i. More importantly, they will be sufficiently close that none of the p 
segments x{t), t* < t < t* + s, will diverge from each other for i = 1, . . . ,p. Therefore 
extrapolation using least squares will improve the order of accuracy (see Figure 2 of [H]). 

The situation is quite different if we take s = alogjT/iJ, with say a = 0.75. In 
this case, we want to predict an instant that gets farther out in time as T increases. In 
this situation the p segments x{t), t* < t < t* + s, with i = 1, . . . ,p will diverge from 
each other with high probability ruining any attempt to extrapolate using linear least 
squares. One may attempt to patch the situation by trying to classify the p segments into 
clusters that stay close to each other and then picking one of the clusters to extrapolate 
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from t = T to t = T + s. But to do so would be to get back to our point that one 
has to decompose the distance between segments of the signal into stable and unstable 
components for optimal prediction. 

Even with s = 1, in which case extrapolation using least squares improves the accu- 
racy of the basic embedding predictor, there are advantages to decomposing the distance 
between segments of the signal into stable and unstable components. Such a decompo- 
sition will allow us to weight the different segments from the past and wring all the 
information out of the signal. Conversely, ideas such as extrapolating using linear least 
squares may be useful once the basic issue of resolving the distance between segments 
into stable and unstable components is addressed. 

Other ideas for improving the basic embedding predictor are to use higher order 
polynomials for extrapolation [9j, to trap the delay coordinates at t = T within a simplex 
in reconstructed phase space [20], or to weight close recurrences using the closeness of 
the approach pQ. The merits and demerits of these ideas are as in the discussion above 
and nothing more needs to be said. Another idea is to extrapolate from t = T to 
t = T + 1 using the embedding predictor possibly with enhancements and then iterate 
the extrapolation from t = T to t = T + la, total of s times to extrapolate from t = T 
to t = T + s. The merit of this idea is to bring in new information from the signal to 
evaluate intermediate points such as t = T + 1 and t = T + 2. However, the embedding 
predictor continues to be suboptimal even with this enhancement. The problem is that a 
single step of extrapolation will throw away all the information about stable and unstable 
manifolds in the vicinity of t = T . The way the stable and unstable components of the 
distance between two segments of the signal must be taken into account depends upon 
how far into the future we want to extrapolate, as will become clear in the next section. 

6 Character of an optimal predictor 

In this section, we give a sense of how an optimal predictor might work. Although a 
general purpose optimal predictor has not yet been derived, it is possible to give a sense 
of what such a predictor should do. 

Suppose c is a fixed point of the map /. The iterates at c will obviously look like 

c, c, c, . . . 

Suppose we pick a point Xq within a distance e of c and look at the sequence 

Xo,f{Xo),f (Xo),... 

When is the latter sequence closest to the former sequence? The answer is they are 
closest when Xq lies on the stable manifold of c. If it lies on the unstable manifold of 
c, on the other hand, the latter sequence will quickly diverge from the former. Here we 
already see the basic ingredient for optimal prediction. For a good match between the 
sequences, it is not enough to pick Xq close to c but we have to pick Xq to be on or close 
to the stable manifold of c. An optimal predictor has to implement this idea using time 
series data and nothing more. 
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In general, it is impossible to pick a point that is exactly on the stable manifold. 
Therefore, we expand upon what it means to pick a point that is close to the stable 
manifold. Let c be a point on the hyperbolic attractor. Let us suppose that x is close 
enough to c and that we may write 



X = c + ^a,t;+(c) + (c). (6.1) 



Here vf{x) are unit vectors in the tangent space at x corresponding to positive Lya- 
punov exponents and the v^{x) are unit vectors corresponding to negative Lyapunov 
exponents. For simplicity, we assume the Lyapunov exponents to be distinct with u 
positive exponents and s negative exponents. Let be the characteristic multiplier 
corresponding to v^' and similarly let correspond to (if / is a Lyapunov exponent 
defined using natural logarithms, exp(/) is the corresponding characteristic multiplier). 
We have 

r(x)^c„ + X:a.(A+)%+(c„) + ^6,(A,-)"t;r(c„) where c„ = /"(c). (6.2) 

i=l i=l 

Here we have assumed that the expansion along the directions vf and v~ is by the 
same factor with each iteration. With that assumption, it is easier to bring out the 
essential aspects of the heuristic argument we are developing here. Note that 



A+ 



> 1 



and 



A; 



To e^ 



< 1. 

iminate some linear algebra from the discussion, we will assume that vf{x), 
1 < i < u, and v^{x), 1 < i < s, form an orthonormal basis for the tangent space at 
each point x on the hyperbolic attractor. For the related concepts of adapted metric 
and adapted coordinates, see [TB] . 

Suppose (as usual) that the points in the available trajectory are xo,---,xt with 
xt = c. To predict the sequence /(c), P{c), . . . ,f''{c), with k ^ logaT/iJ, we will look 
at points from the sequence Xq, . . . , xt-u that are close enough to xt = c and can be 



represented in the form (6.1). Here we will examine what kind of points x are available 



in the sequence and which ones will be useful predictors. 



Let us try to find an x of the form (6.1) in the available history with Oj = Ai5 for 



1 <i <u and hi = Bi5 for 1 < i < s with Ai and Bi fixed to determine the shape of the 



box around c and with as small a 5 as possible. Kac's theorem (Theorem 2.2) suggests 
that we may find a point in the available history in a box around c if the volume of the 
box is 1/T or more. Thus in a box of shape determined by Ai and Bi, the smallest 6 
that leaves the box large enough to be likely to include a point from the available history 
is given hj Ai . . . AuBi . . . BsS'^'^^ ~ 1/T. In fact, we will allow the stable components 
bi to be as large as the tolerance allows. In that case, the box has dimensions = AiS 
and bi = 0(1). The smallest delta should then satisfy 

A,...AJ^^^ (6.3) 
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for some constant C. 

We may now try to choose the shape of the box to allow f"'{x) to stay close to /"(c) 
ioT n = 1, . . . , k. If we estimate the distance between /"(x) and /"(c) using (6.2), the 
distance comes out as follows: 



\\nx)-nc) 



\ 



2n 



(6.4) 



1=1 



< 1 and these stable com- 



Here we have neglected the /x^ components because 
ponents diminish rapidly with n. As long as the stable components are less than a 
tolerance, we do not need to worry about them. Given the constraint on how small the 
box can get, the best shape is obtained by taking Ai = 1/ (^^t^ ■ The value of 6 imphed 



by (6.4) is 



5" 



(6.5) 



T 

and the minimum possible value of ||/"(a;) — /"(c) || is approximately 6^/u. 

From this heuristic calculation, we learn two things. If we want to pick an x from 
the available history to minimize ||/"(x) — /"(c)!! it is not enough to simply pick an 
X from the history that is as close to c as possible. We have to balance the sizes of 
the unstable components Oj carefully. The stable components 6j can be as large as the 
tolerance of the problem allows, which means that the best x for predicting /"(c) may 
not be particularly close to c. 

For valid prediction of f"'{c), we require ||/"(x) — /"(c) || ~ 6y/u < tol. If we use 



expression (6.5) for 5, we get 

log2 T + u log2 tol 



n < 



{u/2)\og^u- loggC 



For a hyperbolic attractor, metric entropy is equal to I]log2 ■ From this calculation, 
we understand why the metric entropy shows up the way it does in the entropy theorem. 



In the argument leading up to (6.6) , we assumed x and c to be points on the 



hyperbolic attractor. A predictor which predicts xt+u for n that approaches the upper 



bound in (6.6) or is optimal in the sense of (4.2) has to calculate the Oj in (6.1) using 



t* will give a long fit to 



time series data alone. 

Given a Lorenz signal, suppose we want to assess if t 
the segment following x{T), with the length of fit defined as in (1.2) . If we knew the 
points X{t*) and X{T) in the three-dimensional phase space of the Lorenz flow, as well 
as the decomposition X{t*) — X{T) = s + / + u — where s is along the stable direction 
at X{T), f is along the flow at X{T), and u is along the unstable direction at the same 
point — the assessment would be easy to make. As long as the components / and s are 
below the tolerance, we want the minimum possible for the longest fit. 

The embedding method attempts to estimate the distance between X{t*) and X{T) 



using delay coordinates and the formula (1.3). It does not even attempt to resolve the 



close recurrences into s, /, and u components as an optimal predictor should. 



24 



7 THE WIENER-KOLMOGOROV PREDICTOR 



7 The Wiener-Kolmogorov predictor 

Let x{t) be a signal that arises from a continuous dynamical system X = f{X). From 
the discussion in the previous section, the following picture emerges with regard to the 
prediction of a; (T + s) for < s < t/ given the history of the signal, which is x{t) for 
< t < T. At some point t* with 0<t*<T — tf, the underlying dynamical system will 
have the state vector X (t*). For optimal prediction, we have to use the available signal 
to form an estimate of the state vectors X{T) and X (t*), as well as the decomposition 
of X{t*) — X{T) into stable and unstable components relative to the splitting of the 
tangent space at X{T). In Section 9, we give some idea of how one might go about 
doing such a thing. If two segments of the signal are similar to each other, decomposing 
the distance between the two segments into stable and unstable components is somewhat 
similar to spectral analysis. Indeed the Fade predictor described in Section 9 is similar 
to the Wiener-Kolmogorov predictor in a few respects. 

In this section, we briefly describe the Wiener-Kolmogorov predictor to help bring 
out those points of similarity and to draw attention to the case of certain zero entropy 
signals. Consider the dynamical system 

where the 6i are angular variables and the Ui are the corresponding frequencies. Suppose 
the observed signal is x{t) = J2i=i cos(6'j). Fredicting such a signal is chiefly a matter of 
spectral analysis and the Wiener-Komogorov filter can handle such signals with ease. If 
an optimal predictor for chaotic signals is to be well-behaved in the limit of vanishing 
entropy, it too should be able to handle such signals. 



The entropy theorem (Theorem 2.3) states that thest/^og2T diverges as T — > oo 
for zero entropy signals. Therefore tbest should increase super-logarithmically. For a 
signal derived from ( 7.1[ ), a calculation using Kac's theorem suggests that tbest will be 



proportional to T^/'^ (assuming no rational relationship between the Ui). This is because 
each side of a d-dimensional cube must be of length T"^/'^ for the volume of the cube to 
be 1/T. 

This allows us to comment on an aspect of the entropy theorem, which is the inde- 
pendence of the bound on t{,est/log2^ from the dimension. The entropy theorem leads 
us to think that the recurrences of a signal obtained from a hyperbolic attractor of large 
dimension and a hyperbolic attractor of small dimension will be of the same quality 
if the two attractors have the same metric entropy. Can such a thing really be true? 
It has to be true in the limit T — )■ oo as asserted by the theorem, but the problems 
associated with large dimensionality will certainly be an issue for practical values of T. 
For zero entropy signals of the type considered above, having recurrences of length T^/'^ 
may seem much better than recurrences of length logg T as guaranteed by the entropy 
theorem for signals of entropy 1. However, T would have to be very large for T^/'^ to be 
greater than logjT for even d = 20. Since the Wiener-Kolmogorov predictor is based 
purely on spectral analysis and not on tracking recurrences, it can handle signals of the 
type x{t) = J2i=i cos{9i) with ease for d = 20. 
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In our brief account of the Wiener-Kolmogorov predictors, we try to bring out the 
role of Toephtz operators and matrices because Toephtz matrices will appear later in 
Section 9. Suppose f(t), — oo < t < cxd, is a real- valued function whose auto-correlation 
0(r) is given by (1.1). The Wiener-Kolmogorov predictor uses 

f{T-s)dK{s) 



to predict f(T + a) using f(t) for —oo < t < T and a suitable function K{s) of finite 
total variation. The function K{s) is found to minimize the least squares error which is 



lim — - 

T^oo 2T 



-T 



f{t + a) 



fit - s)dK{s] 



dt. 



The K[s) which minimizes the least square error should satisfy the Wiener- Hopf equation 

(j){a + t) = (pit- s)dK{s). 
Jo 

By choosing appropriate function spaces, the right hand side can be interpreted as the 
application of a Toeplitz operator to K. 

To solve for K, let be the Fourier transform of (pit) and assume log |$((^) | / (1- 
u'^)du} to be finite. Then 



\1'(m + iv) = exp 



271 



log \^{co) 



-i{uj — (m + iv)) 



du 



is free of singularities in the lower half of the complex plane and = |\l'(u)p. The 
Fourier transform k{u) of K{s) is given by 



27r^(a;) 



exp(—iwt) dt / '^{u) eyip(iu{t + a)) du. 



k{u) 



For more details, see 

In practice, all signals must be sampled at a finite rate and no filter can have infinite 
memory. The theory can be modified to handle prediction of series a„, n G Z, using 
finite memory. In this case, the auto-correlation is 



N 



R{k) = lim 



Jv^oo 2N ■ 



- aitti. 



l=-N 



With this definition, R{—k) = R{k). Let r„ = R[n)/R{0). The prediction error 



M 
n=0 
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is minimized in the least squares sense if 

M 
n=0 

for k = 0, . . . , M. The matrix system that must be solved to calculate An is symmetric 
and Toeplitz. 

We have described the Wiener- Kolmogorov predictor for a single noiseless signal. For 
noisy signals, multiple noiseless signals, and multiple noisy signals, see [25] . 

8 Toral automorphisms 

Let A he a, d X d matrix with integer entries and det A = ±1. The map Xn+i = 
AXn mod 1 is a hyperbolic toral automorphism if no eigenvalue of A has unit modulus. 
Here X„ is a vector with d entries each of which is assumed to be in the interval [0, 1). 
Each entry of the matrix vector product AX^. is taken modulo 1 in the interval [0, 1) to 
get Xn+i- The space [0, l)'' is used as the coordinate space of the torus T'^. 

The class of hyperbolic toral automorphisms is a basic example in theoretical dy- 
namics [TB]. Such automorphisms are topologically transitive on the torus and possess 
Markov partitions of arbitrarily small diameter. The physical measure is the Lebesgue 
measure and the entropy is positive. 

In the next section, we consider the prediction of the signal Xq, . . . , xt, where x„ is the 
first entry of Xn for each n, Xq is uniformly distributed on T*^, and Xn+i = AXn mod 1 
for n > 0. The first toral automorphism that is considered is 

Xn+i = AX„ mod 1, ^ = ( 1 J ) • (8-1) 

This matrix A has eigenvalues 2.61803 and 0.381966 and its entropy is log2 2.61803 = 
1.3885. The second toral automorphism that is considered is 



Xn+i = AXn mod I, A= 1 -2 1 | . (8.2 

This matrix A has eigenvalues 2.1479 and —0.57395 ± zO. 368989. In both instances, 
det A=l. 

The optimal prediction scheme given in the next section is specialized to hyperbolic 
toral automorphisms. The purpose of that prediction scheme is to show how the distance 
between close recurrences can be decomposed into stable and unstable components. 
Restricting ourselves to toral automorphisms has the advantage that the tangent space 
splits into stable and unstable directions in exactly the same way at every point on 




the torus (the vectors and in (6.1) do not depend upon c). In the next section. 



we show how signals obtained from hyperbolic toral automorphisms can be predicted 



optimally in the sense defined by (4.2) 



27 



9 OPTIMAL PREDICTION OF TORAL AUTOMORPHISMS 



However, it is important to note that restricting ourselves to the class of hyperbolic 
toral automorphims means that some oddities occur that would not occur with a general 
purpose optimal predictor. Hyperbolic toral automorphism of dimension d are defined 
using finitely many parameters each of which is an integer (entries of the matrix A). 
One may exploit that fact and tweak the predictor in the next section to reconstruct 
the toral automorphism exactly. We do not overly specialize the prediction scheme in 
that way. The purpose of the prediction scheme is to show what kind of considerations 
may arise in the derivation of a general purpose predictor and the exact reconstruction 
of the toral automorphism from time series data is irrelevant in that regard. 

Hyperbolic toral automorphisms are maps. Although the theory has fewer technical- 
ities for maps, continuous signals may be better targets for general purpose predictors. 
Typically maps arise as suspensions of flows or as maps between Poincare sections of 
flows. A lot of information contained in the path taken by the flow is lost when a map 
is derived from a flow. 

9 Optimal prediction of toral automorphisms 

We begin by considering the so-called exponential extrapolation problem. Suppose a 
sequence is defined by 

d 

Sn = Y.CkK ri = 0,l,... (9.1) 

k=l 

The problem is to find S2d, S2d+i, and so on given sq, . . . , S2d-i. Since the sequence 
is defined by d parameters and d parameters A^, it is reasonable to expect that 
the first 2d numbers of the sequence may determine the rest of the sequence. The 
exponential extrapolation problem is to determine the rest of the sequence. It was 
solved by Prony late in the 18th century (see [IB] for a discussion of Prony's method). 
We present a solution based on Pade approximants. Our presentation could be new. 
Pade approximants generalize naturally to vector Pade approximants, which may turn 
out to be useful in deriving a general purpose predictor. For an introduction to Pade 
approximation, see [2]. 



Define f{z) = Y.kLo-^kZ ■ Using ( |9.l| ), we get 

■'^^^ ~ 1 - XkZk ~' 1 + 6iz + ■ ■ ■ + hdZ^ 

The right hand side is the {d — l^d) Pade approximant of f\z). Determining the hi is 
the key to exponential extrapolation. We have 



oo 



ao + ■ ■ ■ + ad-iz'^-^ = (l + biz + --- + bdz'^)Y.Skz'' 

k=0 

oo / min(Q!,fc) 

k=o V j=i 
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Table 5: Length of best fit from the past, suboptimal prediction using the method of 
embedding, and optimal Pade prediction of a signal obtained from the automorphism 



(8.1 ) of the two dimensional torus T 



Equating coefficients of s'' for k = d, . . . ,2d — 1, we get the d equations 



-Sk- 



(9.2) 



This Toeplitz system must be solved to determine bj. Its solvability is a necessary 
condition for exponential extrapolation. Once the bj are determined, (9.2) is used with 
k = 2d, 2d + 1, . . . to determine S2d, S2d+i, and so on. 

The analogy of this process to the Wiener- Kolmogorov predictor is unmistakable. In 
both Toeplitz system must be solved. Once the Toeplitz system is solved, new 

numbers in the sequence are obtained as fixed linear combinations of preceding numbers 
in the sequence. Indeed, it is quite possible that there may be a way to view the Wiener- 
Kolmogorov predictors as variations or extensions of Prony's method as presented here. 
The Toeplitz system that comes up in exponential extrapolation is unsymmetric in 
general, while the Toeplitz system that comes up in the Wiener-Kolmogorov predictor 
is symmetric. 

Let Xo, ■ ■ ■ ,xt be a signal obtained from a hyperbolic toral automorphism as ex- 
plained in the previous section. Suppose we want to compare the segment 



Xt*-2d+l, 



. Xt*-l, Xt* 



with the segment 

XT-2d+l, ■ ■ ■ ,Xt-1,Xt. 

We first form the differences Axj = Xf-2d+i+i — XT-2d+i+i fori = 0, . . . ,2d — 1. Our 
intention is to extrapolate the Axi sequence to figure out how well will predict 
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xt+s- Since the toral automorphisms are carried out modulo 1, we begin by making 
the following modification to the Axj sequence. For each i with < i < 2d — 1, if 
Axi > 1/2, we replace Axi by Axt — 1. On the other hand, if Axi < —1/2, we replace 
Axi by Axi + 1. After these operations, we will have \ Axi\ < 1/2 for i = 0, . . . , 2(i — 1. 

If the point on the torus T'^ that corresponds to x„ is X„, we have Xn+i — Xm+i = 
A{Xn — Xm) mod 1. Therefore if Xt*-2d-i — Xr-2d-i is small enough, the sequence Axj, 
i = 0,...,2(i — 1, can we written as a linear combination of exponentials like the Sj 



sequence in (9.1). The Aj will be the eigenvalues of A. We use a tolerance to check if 
the Axi are small enough to permit sensible exponential extrapolation. 

Using exponential extrapolation, we compute Ax2d, Ax2d+i, and so on, and find the 
maximum n such that each of the numbers 

I Ax2d| , . . . , I Ax2(i+n-l| 

is less than tol. For the computations reported in this section, tol = 0.1. The n found 
in this way is the expected length of fit. The t* which gives the maximum expected 
length of fit is chosen. The sequences Xt^+i,Xt*+2, ■ ■ ■ and xt+i,xt+2, ■ ■ ■ are compared 
to determine the actual length of fit, which is denoted by tpade- 

In Table [sj we list tbest (the best fit from the past defined as in (1.2)), tembed, and 



tpade- For the embedding predictor, we took 2d to be the embedding dimension. By 
going down the table, we can easily detect that the entropy H is greater than 1. It is 
evident that the embedding predictor falls well short of being optimal, while the Fade 
predictor which resolves the distance between segments of the signal into stable and 
unstable components, approaches optimality. 

Both the embedding predictor and the Fade predictor implicitly assume the dimen- 
sion of the torus to be known, which is not true if we are given the signal Xq, . . . ,Xt 
explicitly and nothing more. The dimension can be calculated using one of the methods 
mentioned in the next section or it can be found by trial and error by checking the effec- 
tiveness of the Fade predictor. Such details are not terribly relevant here as our purpose 
is simply to show that the distance between segments of a signal can be decomposed 
into components in a restricted setting and that such a decomposition leads to optimal 
prediction. 

From Figure |4| we see that the best fit from the past does not agree too well with the 
signal at T — 1, T — 2, and so on. However, it suddently slams into the signal starting at 
T and closely tracts the signal for more than 12 iterates. The embedding predictor on 
the other hand does too good a job of fitting the past, but tracks only 5 iterates from 
T onwards. The Fade predictor produces a match that requires a few iterates in the 
past to be close enough for exponential extrapolation. Except for that, it reproduces 
the behavior of the best fit where the signal segment that is chosen from the history of 
the signal slams into the signal at t = T and then tracks it for a number of iterates. 



Table |6j and Figure |5j refer to the toral automorphism defined by (8.2 ). By going down 
Table [6] and comparing it with Table [s} we notice that the automorphism of has lower 
entropy than the automorphism of T^. The tendency of the embedding predictor to fit 
into the past is very pronounced in the middle plot of Figure [5] 
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Figure 4: In each of the three plots, the black dots are part of a signal obtained from 
the iterates of the automorphism (8.1) of the two dimensional torus T^. 
The bigger red dots are: 



Here T 



>21 



(a) the best fit from the past; (b) suboptimal prediction using 
the embedding method; (c) optimal prediction using the Fade method. 
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Table 6: Length of best fit from the past, suboptimal prediction using the method of 
embedding, and optimal Fade prediction of a signal obtained from the automorphism 



(8.2) of the three dimensional torus T . 
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Figure 5: In each of the three plots, the black dots are part of a signal obtained from 
the iterates of the automorphism (8.2) of the three dimensional torus T^. Here T = 2^^. 
The bigger red dots are: (a) the best fit from the past; (b) suboptimal prediction using 
the embedding method; (c) optimal prediction using the Fade method. 



The figures and tables of this section give a good sense of how much is lost when a 
predictor fails to account for the unstable components of the distance between segments 
of the signal. They also suggest that a predictor which subjects the signal to more 
delicate analysis should be able to approach optimality. 

10 Time series embedding and prediction 

Current predictors of chaotic signals use delay coordinates. It has even been stated 
that "for prediction problems past based coordinates are unavoidable" [B]. While it is a 
tautology to say that a predictor must be based on information from the past, we are 
not convinced that delay coordinates are the best or the only way to extract information 
from the history of a signal. 

The idea behind delay coordinates is mainly topological in nature. Consider a com- 
plete graph Kn with n vertices and with n > 5. If such a graph is drawn on the plane, 
two edges must intersect as a consequence of the Jordan curve theorem. One the other 
hand, if each vertex is assigned to points in M.^ that are "random" or in general position, 
no two edges will intersect. Two typical lines in do not intersect. Any topological 
manifold of dimension m can be embedded in M^^^^ using points in general position 
and partitions of unity [S] . 

The theoretical justification of attractor reconstruction using delay coordinates is 
due to Takens [2T]. The key idea is that delay coordinates generically give points in 
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general position. At several points the proof of the embedding theorem of Takens is 
similar to the proof of the classical embedding theorem by Hurewicz [13]. Fixed points 
and short periodic orbits require special treatment. 

Delay coordinates are used by methods that infer attractor dimension from time 
series data [HI [HI 112] • Here we note that dimension is a topological quantity. 

With regard to prediction of signals or time series, purely topological information 
will not be sufficient. The embedding predictors attempt to estimate the distance be- 
tween points in phase space using delay coordinates. A more systematic study of metric 
information may be useful. 

Suppose that t > 0, is a chaotic signal. Let the points in phase space that 
correspond to t = ti, t = t2, and t = he A, B, and C, respectively. Suppose that the 
signal is know for < t < T and that ti, t2, are all less than T. We ask the following 
question: how accurately can the angles of the triangle ABC be estimated using the 
time series data? This is an example of the type of question which tries to get at metric 
information rather than the topological kind. 

11 Conclusion 

The central point of this paper is that a good predictor of chaotic signals must not simply 
try to find a pattern of events that is as close as possible to the pattern of events leading 
up to the current time. The distance between the two patterns of events must be resolved 
into stable and unstable components. The magnitudes of the unstable components must 
be small and delicately balanced for optimal prediction. The stable components on the 
other hand are typically as large as the tolerance for correct prediction permits. 

This conclusion has a counter-intuitive consequence. Because the stable components 
are typically not small, the known pattern of events which is best suited for predicting 
the current pattern of events will not resemble the current pattern particularly closely. 

We have made this point using theoretical arguments and examples whose simplicity 
made them suitable for illustration. We certainly feel that this point is of very general 
applicability. Wherever signals or patterns need to be predicted using a database of 
previously recorded patterns, this point is very likely to be of much importance. This 
should be true whether the signal is noisy or not, although the type of validity and man- 
ner of prediction of noisy signals will differ. This should be true for temporal patterns 
as well as spatial patterns, although we have not treated spatial patterns explicitly. 

Suppose we follow a storm system which crosses from Canada to Minnesota and 
then to Wisconsin and makes landfall at Michigan after crossing Lake Michigan. If we 
want to predict the path of the storm after it makes landfall, it is reasonable to look for 
earlier storms which followed a similar path across Lake Michigan. The arguments and 
computations of this paper suggest strongly that such pattern matching is not the right 
idea. Instead, we have to look for an earlier storm whose eastward path across Lake 
Michigan may in fact look different. The key property is that the unstable components 
that separate the two paths must be small so that the paths of the two storms are 
converging. One has to analyze the unstable components to determine for how long 
the two paths are likely to remain close after landfall. Although the method for such 



33 



REFERENCES 



analysis is unknown, we have offered a few liints. 
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